Data Compression with Prime Numbers 



Gordon Chalmers 

e-mail: chalmers@quartz.sliango.com 

Abstract 

A compression algorithm is presented that uses the set of prime numbers. Se- 
quences of numbers are correlated with the prime numbers, and labeled with the 
integers. The algorithm can be iterated on data sets, generating factors of doubles 
on the compression. 



There are publicly available tens of million of prime numbers. The bit complexity 

of these sets of numbers can be reduced exponentially by associating an integer count 
to the sets of these numbers. Because there are an approximate N/ In(iV) prime num- 
bers below a number TV, the complexity of the data set is reduced by an approximate 



This reduction in the complexity of the prime number data set can be used to 
compress any set of integers as well. By breaking the number up into bit sequences 
which are prime, and then labeling the numbers with the integers, a string of bits can 
be reduced in complexity by a In(A^) of the individual number sequences. 

For example, consider the number N=101113. This number has the prime se- 
quences of 101 or 113, which are the 26 and 27 the prime numbers. The two numbers 
could be registered by their indices, rather than the number sequences contained in 
N. The bit complexity is reduced to four digits, the 26 and 27, rather than the 6 
digits of 101113. This reduction is not much for small numbers, but can be larger 
for prime number subsequences with ten or tens of digits. The relevant ratio is the 
fraction of the number of prime numbers below a number N , which decreases as the 
number N increases. 

The natural question is given a sequence of digits, what is the probability of 
finding a prime number contained in the subsequence. Clearly, a number with only 
an even number of digits would fail this test, but real datasets arent expected to be 
composed of only even number digits. 

The chance of a random number N of being prime is an approximate l/ln(A'^), 
which is the inverse of the number of digits. Checking a subsequence of a number 
with Nd digits requires summing the probabilities. Checking the subsequences of a 
number with P(i digits to Qd digits requires the sum. 



which is always greater than unity if the numbers are chosen in a certain manner. 
Clearly, if Qd and Pd are chosed as a large ration, the probability will eventually be 
unity. 

The finite sums 



In(Ar). 



Qd 1 

E- 



HQd/Pd) + c + o{Pd/Qd) , 



(1) 




(2) 



i + i + - + 5>'S9 (3) 

are almost unity. The probability is unity to find a prime subsequence made of 
between 6 to 13 digits in a number containing more than 13 digits. There are 
public databases of the first 15 million prime numbers, which consist of up to nine 
digits. 

The examples in Q indicate that the public databases could be used to break 
up numbers into sequences of numbers which are prime. The bit complexity in the 
reduction depends on the size of the sequences. The bit complexity of a prime number 
of the size of 10^ is 30 and that of its index with a number of the order 10^ is 23; 
the ratio is a naive estimate of the 'worst' case scenarios of using sets of 23 bits to 
label the prime number index versus the actual bit complexity of a nine digit number. 
This ratio is 1.3, which signals a 30 percent compression factor. An interesting aspect 
is that this algorithm can be iterated multiple times, with 30 percent factor in each 
iteration (three iterations is a factor of 2.2). 

A more efficient algorithm is to use two units. The first states how many digits 
in the prime label, from 4 to 7, which has two bits. The second number specifies the 
prime number index. The advantage of this is that not all prime numbers have 9 digits 
(out of the exampled five to nine digits). The index with four digits requires 13 bits 
and the index with seven digits requires 23 bits. This should increase the compression 
factor to almost two, considering the distribution and probability of finding the prime 
number sequences. (This version is similar to minimizing the bit vacancies in a byte or 
series of bits are not required to specify the 'color' of the data in certain compression 
schemes jT].) 

The examples listed pertain to prime numbers with up to 9 digits, such as one 
billion. Asymptotically the larger the number of digits in the prime number, the larger 
the compression factor will be. Scanning for sequences with Pd to Qd digits, with large 
numbers of digits is computationally intensive, but this will lead to larger compression 
factors. There is no bound to the compression factor given the distribution of primes 
A^/ln(A^); this says something about the entropy of the information. 

The iteration of the algorithm could easily produce compression factors of ten 
or so, given a front end for searching the prime sequences of the number A^. A 
database of the first 15 million primes would require a gigabyte of storage. Also, this 
compression algorithm can be incorporated with existing algorithms. 
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